Active Learning for Crowd-Sourced Databases

نویسندگان

  • Barzan Mozafari
  • Purnamrita Sarkar
  • Michael J. Franklin
  • Michael I. Jordan
  • Samuel Madden
چکیده

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, or sentiment analysis. However, due to the time and cost of human labor, solutions that solely rely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowdsourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e, label much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy-to-use for a broad range of practitioners, even those who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazon’s Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1–2 orders of magnitude fewer questions than the baseline, and 4.5–44× fewer than existing active learning algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are oŸen limited to small datasets (i.e., a few thousand items). is paper proposes algorithms for integrat...

متن کامل

Crowd-sourcing and author submission as alternatives to professional curation

Can we decrease the costs of database curation by crowd-sourcing curation work or by offloading curation to publication authors? This perspective considers the significant experience accumulated by the bioinformatics community with these two alternatives to professional curation in the last 20 years; that experience should be carefully considered when formulating new strategies for biological d...

متن کامل

Combining preference and absolute judgements in a crowd-sourced setting

This paper addresses the problem of obtaining gold-standard labels of objects based on subjective judgements provided by humans. Assuming each object can be associated with an underlying score, the objective of this work is to predict the underlying score efficiently and accurately based on preference and absolute judgements via experiments in a crowd-sourced setting. Unlike previous informatio...

متن کامل

The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior

There are considerable differences in remuneration and environment between crowd-sourced workers and the traditional laboratory study participant. If crowd-sourced participants are to be used for information retrieval user studies, we need to know if and to what extent their behavior on information retrieval tasks differs from the accepted standard of laboratory participants. With both crowd-so...

متن کامل

Crowd - sourced

While many Learning Content Management Systems are available, the collaborative, community-based creation of rich e-learning content is still not sufficiently well supported. Few attempts have been made to apply crowd-sourcing and wiki-approaches for the creation of elearning content. In this article, we showcase SlideWiki -an Open Courseware Authoring platform supporting the crowdsourced creat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1209.3686  شماره 

صفحات  -

تاریخ انتشار 2012